Packages and Data

# Libraries

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(patchwork))
sleep <- read.csv("cmu-sleep.csv")

Introduction

Motivation

It is important for students to succeed during their first year of college for numerous reasons. However, the transition to college life presents many challenges, which can often result in the comprising of a student’s sleep. Sleep is crucial to cognitive function, so the reduction in sleep could threaten a student’s ability to suceed in their first year of college. We hypothesize that better sleep could lead to a higher GPA, and suggest that university policy and student behavior can be adjusted accordingly to enhance academic outcomes.

Overall Theme

The overarching aim is to discern the relationship between sleep habits and academic success during the pivotal first year of college. This exploration is structured into three distinct sections: - Firstly, the connection between sleep duration and academic achievement is examined. - Secondly, we consider whether sleep variability, as an indicator of sleep quality, might also correlate with GPA. - Lastly, we identify if these relationships hold true when accounting for demographic factors like race, gender, and first-generation college status.

Exploratory Data Analysis

Overview

Our data was collected from a study on the CMU Data Repository that surveyed first-year students from Carnegie Mellon University (CMU), The University of Washington (UW) and Notre Dame University (ND). In total, 634 students participated in the survey Students received a Fitbit to track their sleep and physical activity. Additionally, their GPAs were collected from their university’s registrar.

variables such as subject ID, study number, cohort, and demographic details are featured, alongside sleep-related metrics like bedtime variability and total sleep time. The investigation delves into the potential influence of sleep duration on academic performance, specifically changes in the end-of-semester grade point average (GPA) among first-year college students.

We have five categorical variables and ten quantitative variables as displayed below. Some variables described various features of the student such as race and gender; some variables described the sleeping habits of the student; and some variables described the student’s academic performance.

Descriptive Variable Description
Subject ID Unique ID of the Subject.
Study Study Number (corresponding to last table).
Cohort Codename of the cohort that the subject belongs to.
Race Binary label for underrepresented and non-underrepresented students (underrepresented = 0, non-underpresented = 1).
Gender Gender of the subject (male = 0, female = 1), as reported by their institution.
First Generation First-generation status (non-first gen = 0, first-gen = 1).
Sleep-Related Metric Description
Bedtime MSSD Mean successive squared difference of bedtime. This measures bedtime variability, and is calculated as the average of the squared difference of bedtime on consecutive nights.
Total Sleep Time Average time in bed (the difference between wake time and bedtime) minus the length of total awake/restlessness in the main sleep episode, in minutes.
Midpoint Sleep Average midpoint of bedtime and wake time, in minutes after 11 pm.
Fraction Nights with Data Fraction of nights with captured data for the subject.
Daytime Sleep Average sleep time outside of the range of the main sleep episode, in minutes.
Academic Performance Metric Description
Cumulative GPA Cumulative GPA (out of 4.0), for semesters before the one being studied.
Term GPA End-of-term GPA (out of 4.0) for the semester being studied.
Term Units Number of course units carried in the term.
Term Units (Adjusted) Term Units adjusted for mean of 0 and standard deviation of 1.
Study University Semester
1 Carnegie Mellon University Spring 2018
2 University of Washington Spring 2018
3 University of Washington Spring 2019
4 Notre Dame University Spring 2016
5 Carnegie Mellon University Spring 2017

Correlation Matrix

First, the following code chunk generates the correlation matrix for continuous variables such as sleep times, GPA, and other numerical metrics to see how strongly these variables are related.

sleep_continuous <- sleep[c('bedtime_mssd', 'TotalSleepTime', 'midpoint_sleep', 
                           'frac_nights_with_data', 'daytime_sleep', 'cum_gpa', 
                           'term_gpa')]

cor_matrix <- cor(sleep_continuous, use = "complete.obs")  

print(cor_matrix)
##                       bedtime_mssd TotalSleepTime midpoint_sleep
## bedtime_mssd           1.000000000     -0.1378871     0.41007395
## TotalSleepTime        -0.137887141      1.0000000    -0.33204303
## midpoint_sleep         0.410073955     -0.3320430     1.00000000
## frac_nights_with_data -0.444754051      0.1151740    -0.29670431
## daytime_sleep          0.081458938     -0.2925153     0.08864347
## cum_gpa               -0.006016101      0.1103745    -0.19142135
## term_gpa              -0.035991253      0.2016771    -0.19454357
##                       frac_nights_with_data daytime_sleep      cum_gpa
## bedtime_mssd                    -0.44475405    0.08145894 -0.006016101
## TotalSleepTime                   0.11517399   -0.29251526  0.110374482
## midpoint_sleep                  -0.29670431    0.08864347 -0.191421349
## frac_nights_with_data            1.00000000   -0.06463782  0.044623099
## daytime_sleep                   -0.06463782    1.00000000 -0.143174723
## cum_gpa                          0.04462310   -0.14317472  1.000000000
## term_gpa                         0.07412054   -0.15302999  0.638035220
##                          term_gpa
## bedtime_mssd          -0.03599125
## TotalSleepTime         0.20167715
## midpoint_sleep        -0.19454357
## frac_nights_with_data  0.07412054
## daytime_sleep         -0.15302999
## cum_gpa                0.63803522
## term_gpa               1.00000000

Univariate Distribution Analysis

In the following code chunk, we plot histograms and density plots to explore the distribution of important variables such as TotalSleepTime, and term_gpa.

sleep_TotalSleepTime <- sleep %>%
  ggplot(aes(x = TotalSleepTime)) +
  labs(title = "Total Sleep Time Distribution",
       y = "Density",
       x = "Total Sleep Time (minutes)") +
  geom_histogram(aes(y = after_stat(density)),
                 color = "deepskyblue4", 
                 fill = "deepskyblue",
                 binwidth = 11.59) +
  geom_density(fill = "deeppink",
               alpha = 0.2)

sleep_term_GPA <- sleep %>%
  ggplot(aes(x = term_gpa)) +
  labs(title = "Term GPA Distribution",
       y = "Density",
       x = "Term GPA") +
  geom_histogram(aes(y = after_stat(density)),
                 color = "darkorchid4", 
                 fill = "darkorchid1",
                 binwidth = 0.1066) +
  geom_density(fill = "dodgerblue",
               alpha = 0.2)

sleep_TotalSleepTime + sleep_term_GPA

Sleep Time appears to be approximately normally distributed, centered around 400 minutes (which is roughly 6.5 hours). There are some outliers on both the lower and higher ends of the sleep time, but the bulk of the data falls within the normal range. The corresponding density plot confirms the bell-shaped curve.

The histogram for Term GPA shows that the data skews left, indicating that more students have a GPA closer to 4.0 than to the lower end of the scale. There’s a clear peak around the GPA of 3.5. The density plot overlays the histogram, providing a smoothed curve representation of the distribution, emphasizing the skew towards higher GPAs.

Bivariate Quantitative

sleep %>%
  ggplot() +
  geom_point(data = subset(sleep, daytime_sleep <= 77), 
             aes(x = midpoint_sleep, y = TotalSleepTime, color = daytime_sleep), 
             alpha = 0.85) +
  scale_color_gradient2("Average \nDaytime Sleep \n(minutes)", 
                        low = "red", mid = "orange", high = "blue",
                        midpoint = median(sleep$daytime_sleep)) +
  geom_point(data = subset(sleep, daytime_sleep > 77),
             aes(x = midpoint_sleep, y = TotalSleepTime), 
             alpha = 0.85) +
  labs(title = "Total Sleep Time vs. Midpoint of Sleep",
       x = "Midpoint of Sleep (minutes after 11pm)",
       y = "Total Sleep Time (minutes)")

sleep %>%
  ggplot(aes(x = bedtime_mssd, y = midpoint_sleep, color = Zterm_units_ZofZ)) +
  scale_color_gradient2(low = "red", mid = "orange", high = "blue") +
  geom_point(alpha = 0.95)

sleep %>%
  ggplot(aes(x = frac_nights_with_data, y = cum_gpa, color = daytime_sleep)) +
  scale_color_gradient2(low = "red", mid = "orange", high = "blue") +
  geom_point(alpha = 0.5)

sleep$cat_race <- as.factor(sleep$demo_race)
sleep$cat_gender <- as.factor(sleep$demo_gender)
sleep$cat_firstgen <- as.factor(sleep$demo_firstgen)
sleep %>%
  ggplot(aes(x = cat_gender, fill = cat_race)) +
  facet_wrap(~ cat_firstgen) +
  geom_bar(position = "dodge")

library(ggridges)

sleep %>%
  ggplot(aes(y = cat_gender, x = cum_gpa)) +
  facet_grid( ~ cat_race) +
  ggridges::geom_density_ridges(rel_min_height = 0.01,
                                alpha = 0.75,
                                aes(fill = cat_firstgen))
## Picking joint bandwidth of 0.232
## Picking joint bandwidth of 0.129
## Picking joint bandwidth of NaN
## Warning in FUN(X[[i]], ...): no non-missing arguments to max; returning -Inf

sleep %>%
  ggplot(aes(x = daytime_sleep, y = term_gpa, color = Zterm_units_ZofZ)) +
  geom_point(alpha = 0.75) +
  scale_color_gradient2("Term Units \n(Z-Score)",
                        limit = c(-2, 2),
                        low = "red", mid = "green", high = "blue", 
                        na.value = rgb(1, 0.96, 0.83, alpha = 0.001)) +
  labs(title = "Term GPA vs. Daytime Sleep",
       x = "Daytime Sleep (minutes)",
       y = "Term GPA") +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

sleep %>%
  ggplot(aes(x = TotalSleepTime, y = term_gpa, color = midpoint_sleep)) +
  scale_color_gradient(low = "orange", high = "blue") +
  geom_point(alpha = 0.5)

sleep %>%
  ggplot(aes(x = Zterm_units_ZofZ, y = term_gpa, color = TotalSleepTime)) +
  geom_point(alpha = 0.85) +
  labs(title = "Total Sleep Time vs. Midpoint of Sleep",
       x = "Term Units (Z-Score)",
       y = "Term GPA") +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 147 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 147 rows containing missing values (`geom_point()`).

sleep <- sleep %>%
  subset(daytime_sleep < 250)

summary(lm(cum_gpa ~ daytime_sleep, data = sleep))
## 
## Call:
## lm(formula = cum_gpa ~ daytime_sleep, data = sleep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2921 -0.2272  0.0860  0.3061  0.7084 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.561064   0.032507 109.547  < 2e-16 ***
## daytime_sleep -0.002322   0.000676  -3.434 0.000633 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4337 on 631 degrees of freedom
## Multiple R-squared:  0.01835,    Adjusted R-squared:  0.01679 
## F-statistic:  11.8 on 1 and 631 DF,  p-value: 0.0006328
filter(sleep, cohort == "nh")$Zterm_units_ZofZ %>%
  is.na() %>%
  sum()
## [1] 147
sleep_quant <- sleep %>%
  filter(!(cohort == "nh")) %>%
  select(!c(subject_id, study, cohort, 
            demo_race, demo_gender, demo_firstgen,
            cat_race, cat_gender, cat_firstgen))



sleep_pca <- prcomp(sleep_quant, center = TRUE, scale. = TRUE)
summary(sleep_pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.4769 1.2763 1.1630 1.0107 0.93478 0.79195 0.73231
## Proportion of Variance 0.2424 0.1810 0.1503 0.1135 0.09709 0.06969 0.05959
## Cumulative Proportion  0.2424 0.4233 0.5736 0.6871 0.78423 0.85392 0.91350
##                            PC8     PC9
## Standard deviation     0.64686 0.60005
## Proportion of Variance 0.04649 0.04001
## Cumulative Proportion  0.95999 1.00000
suppressPackageStartupMessages(library(factoextra))

fviz_eig(sleep_pca, choice = "variance", ncp = 9, addlabels = TRUE) +
  geom_hline(yintercept = 100 * (1 / ncol(sleep_quant)))

sleep_nh <- sleep %>%
  filter(!(cohort == "nh"))

sleep_nh <- sleep_nh %>%
  mutate(pc1 = sleep_pca$x[,1],
         pc2 = sleep_pca$x[,2],
         pc3 = sleep_pca$x[,3])

sleep_nh %>%
  ggplot(aes(x = pc1, y = pc2)) +
  labs(title = "PCA Plot for Sleep",
       x = "PC 1",
       y = "PC 2") +
  geom_point(aes(color = as.factor(study)), alpha = 0.75)

fviz_pca_biplot(sleep_pca, label = "var",
                alpha.ind = 0.25,
                alpha.var = 0.75,
                repel = TRUE,
                # Set the color of the points to decades variable:
                habillage = sleep_nh$cohort, pointshape = 19)

# standardize

# dist matrix
sleep_dist <- sleep_quant %>%
  scale(center = FALSE,
        scale = apply(sleep_quant, 2, sd, na.rm = TRUE))

sleep_dist <- sleep_dist %>%
  dist(sleep_quant, method = "euclidean")

plot(as.dendrogram(hclust(sleep_dist, method = "single")))

plot(as.dendrogram(hclust(sleep_dist, method = "complete")))

suppressPackageStartupMessages(library(dendextend))

sleep_complete_dend <- as.dendrogram(hclust(sleep_dist, method = "complete"))
sleep_complete_dend <- set(sleep_complete_dend, "branches_k_color", k=5)

plot(sleep_complete_dend)

sleep_colors <- ifelse(sleep$study == 5, "red",
                       ifelse(sleep$study == 4, "orange",
                              ifelse(sleep$study == 3, "gold",
                                     ifelse(sleep$study == 2, "green", "blue"))))

plot(set(sleep_complete_dend, "labels_colors", 
         order_value = TRUE, sleep_colors))

Bivariate Exploratory Data Analysis